Machine learning-enabled prediction of prolonged length of stay in hospital after surgery for tuberculosis spondylitis patients with unbalanced data: a novel approach using explainable artificial intelligence (XAI)

Background Tuberculosis spondylitis (TS), commonly known as Pott’s disease, is a severe type of skeletal tuberculosis that typically requires surgical treatment. However, this treatment option has led to an increase in healthcare costs due to prolonged hospital stays (PLOS). Therefore, identifying risk factors associated with extended PLOS is necessary. In this research, we intended to develop an interpretable machine learning model that could predict extended PLOS, which can provide valuable insights for treatments and a web-based application was implemented. Methods We obtained patient data from the spine surgery department at our hospital. Extended postoperative length of stay (PLOS) refers to a hospitalization duration equal to or exceeding the 75th percentile following spine surgery. To identify relevant variables, we employed several approaches, such as the least absolute shrinkage and selection operator (LASSO), recursive feature elimination (RFE) based on support vector machine classification (SVC), correlation analysis, and permutation importance value. Several models using implemented and some of them are ensembled using soft voting techniques. Models were constructed using grid search with nested cross-validation. The performance of each algorithm was assessed through various metrics, including the AUC value (area under the curve of receiver operating characteristics) and the Brier Score. Model interpretation involved utilizing methods such as Shapley additive explanations (SHAP), the Gini Impurity Index, permutation importance, and local interpretable model-agnostic explanations (LIME). Furthermore, to facilitate the practical application of the model, a web-based interface was developed and deployed. Results The study included a cohort of 580 patients and 11 features include (CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage) were selected. Most of the classifiers showed better performance, where the XGBoost model has a higher AUC value (0.86) and lower Brier Score (0.126). The XGBoost model was chosen as the optimal model. The results obtained from the calibration and decision curve analysis (DCA) plots demonstrate that XGBoost has achieved promising performance. After conducting tenfold cross-validation, the XGBoost model demonstrated a mean AUC of 0.85 ± 0.09. SHAP and LIME were used to display the variables’ contributions to the predicted value. The stacked bar plots indicated that infusion volume was the primary contributor, as determined by Gini, permutation importance (PFI), and the LIME algorithm. Conclusions Our methods not only effectively predicted extended PLOS but also identified risk factors that can be utilized for future treatments. The XGBoost model developed in this study is easily accessible through the deployed web application and can aid in clinical research. Supplementary Information The online version contains supplementary material available at 10.1186/s40001-024-01988-0.


Introduction
Spinal infection (SI) is a condition that occurs when the vertebral body, intervertebral disc, and surrounding paraspinal tissue become infected [1].Tuberculous spondylitis (TS), commonly known as Pott's disease, is a common cause of SI.While rarely seen in developed countries, it remains a significant public health concern in developing nations [2].TS accounts for approximately 50% of extrapulmonary musculoskeletal tuberculosis cases and is one of the most frequent forms of skeletal tuberculosis [3].Incidence rates of TS have increased in developing countries in recent years.Diagnosis of TS typically involves laboratory, radiology, and histopathology evaluations.Tuberculosis may lead to a kyphotic deformity of the spine associated with paralysis.The management strategy for tuberculous spondylitis (TS) is contingent upon the extent of the disease and the identification of neurological deficits through thorough clinical examination.Treatment for TS primarily involves adequate antituberculous drugs with or without surgical interventions to ensure sufficient decompression, debridement, and spinal stability [4].Nonetheless, surgical treatment rates for TS persist at high levels, particularly in developing countries [5].As a result, hospitalization-related healthcare expenses remain a significant healthcare burden for both patients and healthcare systems.
The escalation of healthcare costs in nearly all countries, particularly those that are developing, has emerged as a pressing social issue.One of the urgent challenges that require attention is the problem of extended hospital stays.The length of stay (LOS) after surgery, which denotes the duration of hospitalization, is widely considered a key indicator of efficiency and a proxy for the utilization of hospital resources [6].Moreover, understanding LOS can facilitate improved comprehension of patient flow patterns, which are essential for comprehending the operational and clinical functions of healthcare systems [7].More critically, diminishing LOS can potentially contribute to mitigating costs and enhancing patient outcomes.Consequently, there is a growing interest in developing robust predictive models of LOS.However, previous researches mainly focus on impacts of the degenerative diseases or the procedure approach selection on the LOS after surgery, resorting to conventional statistical analysis that was inferior to the performances of all other machine learning techniques when decisionmaking in clinical practical [8,9].
The duration of hospitalization, commonly known as length of stay (LOS), is associated with a range of health information, including diagnoses, lab results, clinical data, and demographics.The integration of multiple biological information sources into a statistical learning framework represents a valuable application of machine learning (ML) algorithms.ML models can handle a diverse array of data types, including sociodemographic information, symptoms, laboratory tests, radiology tests, and others, which can significantly enhance classification accuracy [10].Nonetheless, ML models often operate as black boxes, rendering the interpretation of their predictions difficult, particularly for complex models.To mitigate this limitation, several explainable artificial intelligence (XAI) techniques have been introduced to elucidate how these models function [11].An effective approach involves employing the additive feature attribution method, wherein adjustments are made to input variables to evaluate the individual contribution of each feature concerning the model's predictive outcomes.Given the critical nature of clinical decisions, understanding how models make predictions is paramount.Toward this end, the local interpretable model-agnostic explanation (LIME) approximates black-box models with locally generated surrogate models [12].Moreover, the Shapley additive explanation (SHAP), a recent XAI technique, offers a more comprehensive perspective of each feature's contribution to predicting clinical outcomes [13].Applying SHAP to our analysis, we identified the features with the greatest predictive power to delineate between target and non-target groups.
The issue of extended hospitalization poses a significant challenge for patients residing in underdeveloped areas, necessitating the development of accurate predictive models that rely on readily available data sourced from routine primary care.To address this challenge, advanced data mining techniques, predominantly machine learning (ML), are being leveraged to identify the predictors associated with prolonged hospital stays.Essentially, these statistical algorithms are capable of learning patterns from nonlinear, high-dimensional data, while simultaneously considering the interactions between multiple relevant variables.This research aims to develop and implement advanced machine learning (ML) models to predict which patients may experience prolonged hospitalization following surgery.Our methodology seeks to address limitations present in conventional prediction models through the incorporation of more sophisticated and advanced ML techniques.In addition, we aim to deploy our predictive tool as an online resource accessible to all interested parties free of charge.

Cohort and study design
The Ethics Committee of the Hospital granted approval for this retrospective cohort study, in which researchers examined the medical files of 580 individuals diagnosed with tuberculous spondylitis.The study specifically targeted individuals who underwent posterior decompression and instrumentation surgery at our medical center during the period from January 1, 2016, to December 31, 2022.To adhere to best practices and current regulatory legislation for data privacy, a thorough de-identification procedure was carried out prior to conducting any data analysis.The diagnoses of tuberculosis spondylitis (TS) relied upon meticulous clinical evaluations, laboratory findings, and radiographic evidence such as X-ray, computed tomography, and magnetic resonance imaging.Postoperative confirmation of TS diagnoses was accomplished through histopathological examination and/ or the detection of acid-fast bacilli using Ziehl-Neelsen staining, and/or the growth of Mycobacterium tuberculosis in culture specimens obtained from bone marrow, abscesses, or tissue [14].Inclusion criteria were: (1) age ≥ 18 years; (2) patients present with a range of symptoms, including but not limited to continuous or unremitting local pain in the neck or back, fever, chills, weight loss, malaise, neurological symptoms, or limited range of motion in the spine; (3) complete information of lab test; (4) complete records of medical imaging examination consisting of preoperative imaging modalities including X-ray, computed tomography (CT), and magnetic resonance imaging (MRI) were performed prior to the initiation of surgical therapy; (5) confirmation the diagnosis using the gold standard (microbiology/histopathology examination); and (6) we conducted a surgical procedure involving posterior lumbar spine decompression, debridement, and stabilization in the affected area.To mitigate potential confounding variables associated with multiple procedures, we included only one surgical intervention for each patient who may have undergone several eligible spine operations.More specifically, this study encompassed patients who underwent the procedure of posterior lumbar spine decompression, debridement, and stabilization in the lesion region.

Candidate variables in the models
This study utilized a comprehensive range of five distinct categories of features commonly found in electronic health records (EHRs).These categories encompassed: (1) demographic information such as age, sex, and marital status; (2) patient symptoms and medical history; (3) preoperative laboratory data and imaging features derived from X-rays, computed tomography (CT) scans, and magnetic resonance imaging (MRI); (4) surgical factors including operation duration, intraoperative blood loss, and the need for transfusions; and (5) postoperative details such as the extent of drainage from the surgical site.All images used in this study underwent review and analysis by a chief physician who was unaware of the clinical and laboratory results.It is worth noting that an expert physician meticulously evaluated all images used in this study, with no knowledge of the clinical and laboratory findings.Since patients often undergo multiple biological and imaging tests throughout their treatment, the data used for the prediction models were based on the admission data, representing the initial test value or report for each specific test.

Target outcomes class imbalance
The primary objective of this study was to analyze the duration of hospital stays (referred to as length of stay or LOS) among participants who underwent surgery.Specifically, it aimed to examine hospital stays beyond the 75th quartile of LOS.A total of 580 participants were included in the study, with 127 demonstrating a normal LOS.This substantial difference in sample size resulted in an imbalance between participants with normal and prolonged LOS.To address this issue, the Synthetic Minority Oversampling Technique (SMOTE) was employed to oversample the minority class, namely, the group with prolonged LOS, in the training set [15,16] (Fig. 1).To ensure the robustness and generalizability of each model, both feature selection and data balancing processes were limited to the training set.
SMOTE sampling method is a popular technique utilized for resampling data.The approach involves selecting a limited number of samples as the basis for generating an augmented dataset.From this pool of selected samples, new samples are created by randomly selecting auxiliary samples from k neighboring samples of the same category, based on sample multiplicity [17].The process is then repeated n times to achieve a balanced and representative dataset that minimizes overfitting [18].

Data pre-processing
Prior to data separation, the data set was divided into training (70%) and test (30%) sets.The stratification of the data sets was based on the outcome variable, ensuring that the proportions of positive and negative samples, specifically normal and prolonged length of stay, remained consistent between the training and testing sets, as well as the overall sample.
The transformation of raw electronic health record (EHR) data into machine-learning suitable variables is a noteworthy aspect.Pain levels were categorized into two groups, namely, moderate (VAS ≤ 5) and severe (VAS > 5), using the visual analog scale (VAS).In addition, during the patient's initial visit, the recorded fever grade was classified as either low (< 38.5 °C) or high (≥ 38.5 °C).Furthermore, the definition of severe vertebral destruction encompassed damage affecting one-third or more of the vertebrae.
Following data separation, missing variable imputation and normalization were separately conducted on both the training and testing sets to avoid data leakage [19].

Fig. 1 Pipelines for the whole research
To normalize the numeric variables, the z-score method was applied.This involved subtracting the mean value of the sample from each individual value and then dividing the difference by the sample standard deviation.To standardize the features, the standard scaler was fitted to the training set by removing the mean and scaling to unit variance.Subsequently, the same scaler was applied to transform the data in both the training and test sets.
The input parameter values and external model configurations, known as hyperparameters, were defined using the training set to optimize predictive accuracy.Conversely, the test set consisted of new data that was not used during algorithm training, enabling the evaluation of model performance on previously unseen data.
To impute missing variables, we employed the multiple imputations by chained equations (MICE) technique [20].

Predictors selection
Given the potential challenges posed by noisy, redundant, and irrelevant variables in hindering the performance of learning algorithms, it is imperative to employ effective feature selection techniques.In this study, we employed several methods such as variance thresholding, supper vector machine classification (SVC)-based recursive feature elimination (SVC-RFE), Pearson correlation analysis, LASSO, and permutation importance ranking to address these challenges.Specifically, we utilized the Pearson correlation analysis to eliminate highly correlated variables, while variables with Pearson correlation coefficients above 0.90 were considered to exhibit low quality and high redundancy and were therefore excluded from the analysis altogether.The LASSO method, a supervised machine-learning technique and a penalized regression model capable of enhancing prediction accuracy and interpretability of the statistical model, was used to select the most useful predictive features in high-dimensional variables [21].By means of discarding confounding variables and minimizing the coefficients of comparatively irrelevant variables to 0 using a tenfold cross-validation technique in the training data set, we were able to construct a statistically significant model with superior performance metrics.
Recursive feature elimination (RFE) is a technique used to select features by iteratively assessing smaller and smaller subsets based on weighted assignments from an external estimator [22].A defined property or callable function is used to determine the relevance of each feature after training the estimator with a starting set of characteristics.The least important characteristics in the present collection are then removed.The trimmed set is then subjected to this recursive approach again until the desired number of chosen features is obtained.In this study, the least significant characteristics were determined using the SVC-RFE approach and fivefold crossvalidation [23].
Automated methods for choosing pertinent characteristics are used in machine learning (ML), some of which are based on built-in measurements of varying relevance.The Gini Importance (MDI: Mean Decrease in Impurity) and the Permutation Importance (MDA: Mean Decrease in Accuracy) are two often used metrics [24,25].These measures rank the relevance of features concerning their utility in predicting outcomes.The importance of features is defined as their contribution towards improving the overall predictive performance of the model.The Gini Importance index is an example of a measure that can be derived from the Gini index, which is used to assess node impurity.Nonetheless, relying on the Gini Importance index to evaluate feature relevance may lead to biased results due to its use of the Gini-gain splitting criterion, which can be influenced by selection bias when analyzing factorial data [26].
Breiman first proposed Permutation feature importance metrics for feature selection in Random Forests, which were later refined by Altmann to address potential biases associated with the Gini Importance and entropy criterion [27].Consequently, Permutation variable importance has emerged as a more reliable approach for selecting features, particularly when working with continuous and highly categorical variables [28].The Permutation variable importance score is calculated using two widely used ensemble tree-based machine learning algorithms-random forest (RF) and XGBoost in this research.Features with higher Permutation variable importance scores are generally more strongly related to predicting prolonged length of stay (LOS).
To improve the efficiency of our feature selection process, we generated an intersection of the feature datasets based on the aforementioned methods.The resulting set of selected features was then employed during the machine learning training phase.

Candidate model frameworks and hyper-parameters tuning
In this study, several algorithms were explored for classification tasks.Naive Bayes (NB) is a supervised probabilistic machine learning algorithm that relies on Bayes' Theorem and assumes strong independence between input features [29].Logistic regression (LR) is a predictive machine learning algorithm with a complex cost function and does not assume feature independence [30,31].K-Nearest Neighbor is a popular algorithm for pattern recognition that predicts the new query instance's category based on the majority of k-nearest neighbor categories [32].Random forest is an ensemble learning method that addresses the issue of overfitting in decision trees by constructing multiple decision trees during the training process [33].Support vector machines is a powerful machine learning algorithm that learns well with small parameters and is robust against various model violations [34].A decision tree is a hierarchical framework that splits independent variables recursively into homogeneous zones [35].XGBoost is an optimized distributed gradient-boosting system that has been specifically designed for high efficiency, flexibility, and portability [36].LightGBM is a tree-based machine learning algorithm that utilizes gradient boosting learning and histogram-based algorithms and grows the tree leafwise [37].
To enhance the predictive accuracy of certain categories while limiting the incidence of false positives, our research team conducted a thorough investigation into the collective efficacy of various machine learning models.Our primary objective was to generate optimal predictions that would yield the best possible outcomes.To accomplish this goal, we implemented a technique known as the voting model or majority voting ensemble, in which we combined the results generated by multiple machine learning algorithms.The purpose of this approach was to identify the most prevalent prediction, which would then serve as the final outcome.The voting model approach is particularly beneficial in cases where multiple models perform well at predicting certain categories, a scenario that was observed in our research.By utilizing this technique, we aimed to achieve superior performance compared to individual models.The voting model approach encompasses two methods: hard voting and soft voting.Hard voting involves combining the predictions for each label and selecting the one with the highest number of votes.In contrast, soft voting involves summing up the probabilities assigned to each class and selecting the prediction with the highest probability.In our research, we utilized the soft-voting method to ensure the model's consistent contributions across all categories.
The investigation comprised a training procedure that made use of grid-search and layered cross-validation.An outer loop for selecting the test set in each fold and an inner loop for further segmenting the data into training and validation sets made up the nested cross-validation technique.The dataset was first divided into train and test sets for the outer loop.Grid-search was used within the inner loop to do k-fold cross-validation on the train set to find the ideal set of hyperparameters from a predetermined list [38].For each optimization trial, the data was divided into training and validation sets, followed by the implementation of inner k-fold cross-validation.The aim was to maximize the overall accuracy from the k-fold validation, to determine the optimal combination of hyperparameters.This optimal combination was then used to train the final model using all available training samples.Subsequently, the resulting model was tested on the outer fold of the test set.This entire procedure was repeated for all outer folds, and the evaluation metrics were computed based on the aggregated test results (Fig. 1).

Model performance assessment and comparison
Evaluation and comparison of the predictive model's performance are crucial in the modeling process after constructing and training the model.For this study, various standard performance indicators were utilized to assess the classifiers' performance.These indicators include accuracy, specificity (also known as recall), sensitivity, negative predictive value (N.P.V.) (also known as precision), positive predictive value (P.P.V.), area under the receiver-operating characteristic curve (AUC), F1-Score, Log-Loss, and Brier-Score.To evaluate calibration, the Brier score was used, with scores ranging from 0 to 1, where lower scores indicate better calibration.Discrimination was measured by the area under the receiver operating characteristic curve (ROC-AUC).A value of 0.50 is used as a reference for a random estimator, while a value of 1.0 is used as a reference for a model with perfect discriminant ability.These metrics were computed using the binary confusion matrix described below.
In our study, patients with prolonged length of stay (LOS) were classified as having prolonged LOS if correctly detected as true positive (TP), while those with normal LOS were classified as true negative (TN).On the other hand, false positive (FP) occurred when a patient with normal LOS was falsely detected as having prolonged LOS, and false negative (FN) was when a patient with prolonged LOS was incorrectly categorized as having normal LOS.We evaluated the accuracy of the predictive models by computing performance evaluation matrices involving TP, TN, FP, and FN classifications for both prolonged and normal LOS outcomes.
AUC and Brier score, which were kept for additional study, were the key criteria used to choose the best model.During the training stage, we assessed the ROC-AUC values for each stage of the fivefold cross-validation method.We also showed the calibration plot and the decision curve analysis curves to help you assess the performance of the model you choose.

Model explanation
Feature selection techniques that estimate the relevance of variables are limited in their ability to describe patterns of dependency between features and responses.Instead of providing a comprehensive understanding, they merely convey the intensity of this dependency as a single number, making the resulting findings difficult to interpret.To overcome this limitation, we utilized two procedures, namely, Shapley additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME), to explain the selected model and assess the contribution of each variable to its performance.These procedures provide valuable insights into the predictive importance of each variable and the direction of their values [12,39] (Fig. 1).
The SHAP approach can make it easier to comprehend the chosen model.Important characteristics in the model are those with a high SHAP value, whereas those with a low SHAP value have a detrimental influence on the model's output [40].On the other hand, characteristics with a positive value support the desired result (prolonged LOS).To illustrate the significance and direction of each predictive predictor, plots were created.Each predictive variable's location on the y-axis was arranged according to its relative relevance, with the most significant predictors at the top.The location of each point on the x-axis for each predictive variable signified the contribution of each participant to the total value of the Shapley additive explanations, with significant positive contributions on the far right (red signifying greater values or the existence of binary components).
To explain model predictions, the local interpretable model-agnostic explanation (LIME) technique is used.The rationale behind this approach is based on the use of a local linear approximation of the model behavior to predict a single sample, which can be more confidently trusted.LIME is a widely recognized tool proposed by Ribeiro et al. [41] that specifically aims to help explain the decision-making processes of complex black-box models.LIME is used to create a new dataset by randomly disrupting the samples, which is then used to train a linear model that locally approximates the black-box model [13].In doing so, LIME enables us to obtain insights into the local decision-making behaviors of the black-box model using an interpretable model.

Web application deployment
The final model was deployed as a web application.With the growing number of scientific findings emerging from omics studies, there is a demand for novel translational medicine applications and bioinformatics tools.These tools aim to facilitate the integration of these findings into clinical practice, bridging the gap between laboratory research and patient care.The translation of knowledge from bench to bedside not only encourages the development of new biotechnological products but also contributes to improving patients' health [42].

Statistical analysis
When comparing categorical variables, the appropriate Chi-square or Fisher's exact test was used to present the data as numbers and percentages.Continuous variables that met the criterion for normal distribution were described as mean values accompanied by their standard deviations [mean (SD)], and compared using the two-tailed Student's t test.Alternatively, median values along with their interquartile ranges [median (IQR)] and Wilcoxon-Mann-Whitney U test were applied to continuous variables that did not meet this normality assumption.Statistical significance was defined as a two-sided P value of 0.05.Graphs presented in this article were constructed using Microsoft Excel, R Studio 4.1.2(R Development Core Team), and Python version 3.9.7.

Baseline characteristics
The data set comprises 580 participants, out of which 21.9% (127) experienced a prolonged length of stay (LOS).Among the participants, 403 were men (70.2%) and 173 were women (29.8%).The participants were 47.2 years old on average.Twenty-two percent, or 117 people, had a history of TB.Older age was associated with prolonged LOS among the patients in the TS group.Further details can be found in Table 1.The proportion of minority samples oversampled by the SMOTE algorithm can be found in supplementary files.

Feature selection
As superfluous information may leave the classifications of LOS unsatisfactory, based on the LASSO method, 26 meaningful variables were filtered out, as shown in Fig. 2A, B. Furthermore, permutation importance calculated by fitting RF and XGBoost algorithms was also taken into consideration (Fig. 2C, D).Firstly, we performed SVC-RFE with fivefold cross-validation to determine the optimal subset of relevant features with minimal redundancy (Fig. 2E).Upon analyzing the plot, it becomes evident that the increase in the number of features beyond 10 does not result in significant elevation.Therefore, we selected ten variables as the most suitable and pertinent number, ensuring the inclusion of only the most relevant variables (Fig. 2F).Regarding the clinical practice and clinical relevance, and combining the results of the above screening procedure.The final training and testing dataset was composed of 11 variables, which are CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, postoperative drainage.To alleviate the issue of variable collinearity, we also examined the correlation between the variables in the dataset using three feature selection methods.First, we generated a heatmap to visualize the correlation coefficients and associated P values of the variables.As depicted in Fig. 3, all the coefficients were less than 0.6, although some variables exhibited statistical significance (P < 0.05).This implies that there is some correlation between certain variables, but the level of correlation is weak (indicated by small correlation coefficients, i.e., less than 0.6).

Model establishment and evaluation
As illustrated in Fig. 4, we compared both nested and non-nested cross-validation approaches when training our models.It can be discerned that the non-nested method will lead to an over-optimistic performanceassessment value than the nested method, which has prone to over-fitting.Therefore, nested-cross validation was adopted in our research.LGB + NB + LR + DT (AUC = 0.84, Brier Score = 0.139) (Fig. 5).What is more, we also calculated other metrics to measure the comprehensive capability of each algorithm.Brier score and log loss were given more attention.

Fig. 3 Heatmap of the correlation coefficient of features
The Brier score was used to measure the accuracy of the predictions using the predicted probabilities in their continuous form; the model with the lowest Brier Score is considered to have the greatest predictive performance.Compared with other models, the XGBoost model achieved the best performance with the highest AUC of 0.86 and the lowest Brier score of 0.126 in the testing group.More detailed information about each model's results can be found in in Table 2. Thus, the XGBoost model was maintained for further analysis, interpretation, and visualization.The prediction results of selected model by different groups can be found in supplementary files.
To further explore the generic discrimination ability of our model.It was evaluated using twofold crossvalidation, where it displayed mean AUC was 0.85 ± 0.09 which is almost equal to the AUC value in the validation set (Fig. 6).We also examined the clinical practical usefulness of the XGBoost model via calibration plot and decision analysis curve (Fig. 7).

Model interpretation of selected predictors
As data are entered and choices are made without a clear grasp of the process between them, machine learning models are frequently criticized for being opaque.To solve this problem, we used the Shapley additive explanations (SHAP) technique, which makes use of SHAP values to evaluate the interpretability of the model by providing a visual representation of the impact of each chosen predictor.The purpose of Fig. 8A, which represents the likelihood of extended LOS, is to evaluate the unique influence of each predictor by the size of its value (coded by a gradient of colors) and tendency direction on the horizontal axis.Infusion volume, transfusions, and MRI epidural abscess are shown to provide significant contributions to the model's output, but X-ray bone bridge and X-ray and MRI-paravertebral abscess have a moderate to low influence.
We generated a SHAP summary plot for the clinical outcome to illustrate the importance and direction of each predictive variable.The variables are sorted by their relative importance on the y-axis.Using the best prediction model, we calculated the SHAP value for each feature and represented them in violin charts, indicating their direction and significance to the overall outcome risk.The color on the plot represents the significance of each variable, with red indicating a higher predictor value and blue representing a lower value.For categorical variables, red denotes the presence of a specific category of the predictor, while blue indicates its absence.On the x-axis, each data point's position reflects the contribution of individual participants to the overall SHAP value, with high positive contributions on the far right.These findings provide insights into both patient-level and cohortlevel risk factors (Fig. 8B, C).As depicted in Fig. 8B, the top five risk factors contributing to prolonged LOS are infusion volume, transfusions, MRI epidural abscess, C-reactive protein (CRP), and CT vertebral destruction.
Stacked bar plots showing the relative significance and distribution of the selected variables, using Gini, permutation importance (PFI), and LIME algorithm, with ranks ordered from bottom to top by their decreasing  proportion of relative importance (Fig. 9).Similarly, infusion volume, CRP, CT-vertebral destruction, and transfusions are still the top features among most combinations of algorithms and important calculation strategies.We also used the SHAP decision plot and Waterfall plot to display the effect of each of the variables for individual predictions (Fig. 10).The decision plot shows the how XGBoost model arrives at its predictions regarding two patients (Fig. 10A).The assessment of a single patient might be interpreted using the waterfall plot.Every prediction began with the base value (0.514), which was the average SHAP value of all forecasts, and it represented the SHAP value of a characteristic as a force to raise or reduce the assessment.How much (in percentage) a certain characteristic contributed to the SHAP value depended on the length of the arrow.The color of the arrow indicated whether the contributions were constructive (red) or destructive (blue).
Elderly patients that experienced larger infusion volume, blood loss, transfusion, epidural abscess, and higher CRP underwent extended LOS as the patient with the higher SHAP value (i.e., 0.898) (Fig. 10B).However, the patient with the lower SHAP score (i.e., 0.038) did not have increased blood loss, had a normal amount of infusion volume, or underwent a transfusion (Fig. 10C).

Online bedside tool deployment
The final XGBoost model we chose was integrated into a web application (Fig. 11) that assesses a person's risk based on input predictors.It may determine the likelihood that TS patients would experience a longer LOS and display that information on calibration plots to define explicit visualizations.Online access to the web application is available at http:// 43.143.217.126: 9090/ tsplos.

Discussion
Our study focused on tuberculosis spondylitis (TS), which is a form of skeletal tuberculosis that primarily affects the spine.TS, also known as Pott's disease, is caused by the bacteria Mycobacterium tuberculosis and is characterized by the infection and destruction of the vertebral bodies and adjacent structures.The infection leads to inflammation and damage to the vertebral bones, intervertebral discs, and surrounding tissues, resulting in severe pain, deformity, and neurological complications.Spine surgery is often employed as a treatment option for patients with TS, aiming to alleviate pain, stabilize the spine, and prevent further neurological deterioration.However, this surgical intervention can be associated with prolonged hospital stays, which increase healthcare costs and pose challenges for patient management and resource allocation.Identifying factors that contribute to extended hospital stays after spine surgery in patients with TS is crucial for optimizing treatment strategies and improving patient outcomes.Therefore, in our study, we aimed to develop a machine learning model that could predict the likelihood of extended hospitalization, Fig. 8 A graphic representation of the impact of predictors on the outcome is depicted using the SHapley Additive exPlanations (SHAP) technique.Firstly, a SHAP bar plot (A) shows the features in descending order from top to bottom, with the most influential risk factors at the top.Secondly, a SHAP summary plot (B) presents lines representing predictors, arranged in decreasing order of importance for outcome prediction.The dots on each line represent the SHAP values corresponding to the predictors at each index checkup visit.The color and position of the dots indicate the strength and direction of the association.Specifically, red denotes a higher value (or presence of a specific category for categorical variables), while blue signifies a lower value (or absence of the category).In addition, dots on the left side of the chart indicate a decreased risk of the outcome, with greater leftward displacement indicating a higher impact on risk reduction.Conversely, dots on the right side represent an increased risk, with greater rightward displacement indicating a stronger impact on risk elevation.Lastly, a SHAP decision plot (C) provides an overview of how each critical parameter influences the final decision.Each colored line in the figure represents the predicted outcome for an individual patient providing insights into risk factors and aiding clinicians in planning postoperative care and resource allocation effectively.Our study successfully developed an interpretable machine learning model using XGBoost to predict extended hospital stays after spine surgery in patients with tuberculosis spondylitis (TS), commonly known as Pott's disease.We identified 11 key features, including CRP, transfusions, infusion volume, blood loss, X-ray bone bridge, X-ray osteophyte, CT-vertebral destruction, CT-paravertebral abscess, MRI-paravertebral abscess, MRI-epidural abscess, and postoperative drainage.The XGBoost model demonstrated superior performance with an AUC value of 0.86 and a low Brier Score of 0.126.By employing SHAP and LIME techniques, we were able to interpret the variables' contributions to the predicted outcomes.Our findings not only provide accurate predictions of extended hospital stays but also contribute to identifying risk factors for future treatments.The web Since the 1940s, when neural networks were first introduced, there has been an enormous surge in research on artificial intelligence (AI) and machine learning (ML).Computational resources, particularly in the last ten years, have gotten more affordable and available.Since more tools and libraries are now readily available, cutting-edge solutions like deep learning (DL) have emerged as a result of this development, democratizing machine learning techniques across a range of industries.For instance, DL techniques outperform conventional algorithms for the image or natural language processing, and they are frequently usable even by subject-matter experts without prior ML training [43,44].The conventional structure of a neural network consists of several layers connected by many nonlinear tangled interactions.It is impossible to fully understand how the neural network arrived at its choice, even if one were to examine all of these levels and characterize their relationships.There is growing worry that these black boxes may be biased in some way and that this bias may go unnoticed in several application domains.This can have significant ramifications, particularly in medical applications [45,46].Making medical decisions involves a lot of risk.It should come as no surprise that medical professionals are concerned about the black box aspect of deep learning.To address this problem, explainable artificial intelligence (XAI) has been proposed to allow people to understand and explain the decision-making process of machine Fig. 10 Personalized interpretation.Two patients from the testing set were shown in (A) SHAP decision plots; one was predicted to have "normal LOS" and the other to have "prolonged LOS".Important features in the XGBoost model correlate to features with a high SHAP value, whereas features with a low SHAP value have a detrimental effect on the model's output.On the other hand, characteristics with a positive value support the desired result (prolonged LOS).B, C SHAP waterfall diagrams for the two TS patients from our dataset who were selected at random.The red depicts the feature's beneficial effects, whereas the blue hue depicts its detrimental effects.Positive qualities led to the category of extended LOS, whilst negative elements contributed to the category of normal LOS learning models.XAI technology is a new and emerging field in AI that aims to explain how AI algorithms work and generate complex decisions by fundamentally investigating the causes and decision rules of AI algorithms [47][48][49].
In our research, we used a model explanation algorithm, and the result shows the identification and understanding of hazardous influences linked to prolonged hospital stays following surgery for tuberculosis spondylitis are crucial for enhancing patient outcomes and optimizing healthcare resources.CRP, a commonly used marker for inflammation, elevated CRP levels post-surgery often implies presence of infection or inflammation.In tuberculosis spondylitis patients, this could lead to delayed recovery and prolonged hospitalization [50].Although transfusions reduce blood loss after surgery, there are hazards associated with them, including infections and transfusion reactions.These issues may make it more difficult for the patient to heal, requiring extended hospital stays for care and observation [51].Proper fluid management is vital for maintaining organ function.
Both excessive and insufficient infusion volumes can potentially prolong hospitalization [52].Significant intraoperative blood loss can result in post-operative anemia and compromised tissue perfusion, delaying the patient's recovery.In addition, blood loss increases the risk of post-operative infections, further extending the hospitalization period.Exitance of X-ray bone bridge and osteophytes implies chronicity of the disease and may indicate a more advanced stage of tuberculosis spondylitis.These could make the surgical procedure more complex, subsequently result in prolonged hospitalization [53].CT-vertebral destruction indicates the extent of vertebral damage.Severe destruction of the vertebral bodies requires more extensive surgical interventions.This can prolong the surgery time and subsequently impact the length of stay in the hospital [54].CT-paravertebral abscess and MRI-paravertebral abscess refer to the presence of abscesses or collections of pus in the tissues surrounding the affected vertebrae.Treatment typically requires drainage of the abscess, which may require additional procedures and postoperative care, may lead to extended hospitalization.MRI-epidural abscess, which indicates the presence of abscess in the epidural space, where is the area around the spinal cord.Epidural abscesses can compress the spinal cord or nerve roots, causes neurological deficits.The involvement of the epidural space can significantly impact the length of stay in the hospital due to the need for close monitoring and management of neurological symptoms.Implementation of postoperative drainage may increase complications such as bleeding, infection at the surgical site, and catheter-related bloodstream infections, these may responsible for prolonged length of hospital stay for tuberculosis spondylitis patients.
The distribution of class unbalance is a typical issue for real-world medical data [55].The number of observations for each class or label is not equal-it is called unbalanced data, which is particularly common in the machine learning field [56].The unequal quality of the majority and minority classes is one of the key issues with employing data analysis for diagnosis and therapy, and it can present challenges for predictive modeling [57].The most common issue brought on by data unbalanced categorization is incorrect diagnoses of the minority class, which is more likely and cost-sensitive than the majority class [58].The accuracy of the general categorization is optimized by many machine learning methods.This design idea, however, leads to mistakes in the minority class's categorization [59].The method, in general, makes a bigger contribution to the higher quality classification of the majority class samples, which are thought to be more important.Standard machine learning with unbalanced data might result in a decision boundary that is heavily skewed by the majority class, resulting in low precision for the minority class.Ultimately, improving the classification performance for the minority class is the aim of tackling the unbalanced data problem.Unbalanced data can have significant effects on model output, one of which is that it may cause the model to give the dominant class priority over the minority class.The model thus exhibits a bias in favor of the majority class, resulting in poor results for the minority class and deceptively high overall accuracy.In addition, common assessment measures like accuracy could not correctly represent how well a model performs with an imbalanced dataset [60,61].Several strategies have been suggested to reduce the effect of unbalanced data on model output.The two primary types of these methods are algorithmic-level methods and data-level methods.Algorithmic level techniques change the learning algorithm itself to manage imbalanced data [62,63].On the other hand, the data-level balancing strategy has been frequently utilized to balance training samples across the available techniques [64].It has also been frequently used to balance data using a loss-based (cost-sensitive) method that gives minority samples more weight than majority ones.The classifierdesign strategy for balancing involves creating computational strategies that are integrated into a classifier to automatically address the class-imbalance problem [62].
To provide each class with a fairer representation, data-level procedures rebalance or expand the dataset.Undersampling the majority class by randomly omitting certain data is one strategy.This method shrinks the dataset but could potentially abandon valuable data.Another strategy is to generate fake samples or duplicate observations to oversample the minority class.SMOTE is the most used oversampling approach [16,63].SMOTE (Synthetic Minority Over-sampling Technique), is a popular data augmentation technique in oversampling.By interpolating between the feature space of two or more nearest nearby samples, SMOTE creates synthetic fake samples of the minority class.The difference between two feature vectors in the minority class is added to one or the other to create a new synthetic point if a point in the minority class is isolated.In this procedure, a minority class observation is chosen at random, and its k nearest neighbors are identified using feature space distance metrics such as the Euclidean distance or Manhattan distance [15,65].Compared to other sampling techniques, SMOTE has a number of advantages.Creating synthetic samples reduces the chance of overfitting the training data and is computationally efficient, because it can quickly process big datasets and create many synthetic examples.SMOTE does not introduce any noise to the dataset and also maintains the minority class's original distribution.SMOTE can also decrease the likelihood of the model overfitting to a few particular minority class data while increasing the generalizability and robustness of the model [66,67].
Artificial intelligence (AI) has accelerated the evolution of medical practice.Recent developments in machine learning, digital data collection, and computer infrastructure have made it possible for AI applications to proliferate in sectors that were previously thought to be the exclusive purview of human knowledge [68].The application of AI at the bedside has several advantages, including reliving the workload of healthcare workers, splitting up the workload of healthcare practitioners, replacing repetitive and requiring little cognitive tasks, and augmenting clinical practice.However, the practical ability of the AI model at the bedside has several challenges to be addressed [69].The first challenge is data availability, Electronic health records (EHRs) are frequently kept in healthcare facilities and contain an abundance of data about a patient's medical history.The information might not be standardized, fragmented, or in a format that AI models can simply process.In addition, using personal data raises a number of privacy and ethical issues, and different jurisdictions might have varying standards for data access [70,71].The lack of interpretability and transparency presents another difficulty when using AI models in clinical settings.Clinicians may find it challenging to comprehend how AI models generate their recommendations due to their complexity.Because of this model's lack of interpretability, doctors may be wary of it and be less likely to trust these tools in actual clinical settings [72,73].Clinical practice might be revolutionized by AI models, which could also lead to better patient outcomes.Their practical applicability and accompanying difficulties must be carefully considered in order for them to be successfully incorporated into routine clinical practice.To overcome these obstacles, it is necessary for physicians, researchers, and developers to work together to build AI models that are compatible with clinical procedures and enhance patient safety [68].
With the advent of web applications for AI models, there is now a great opportunity to expand the reach and impact of AI in medicine.Web applications of AI models have several benefits for healthcare workers, these are: faster and more accurate diagnosis, improved efficiency and workflow, better disease prevention and management, personalized healthcare, and enhanced medical research.These can significantly improve the accuracy and speed of diagnoses, provide personalized medical care, and enhance research efforts.In addition, they can help reduce costs, optimize workflow, and improve patient outcomes [74][75][76].In this study, we have built a web application with the aim of bedside utilization, which can save some effort and bring benefits to clinical practice.
The developed prediction model offers valuable insights that can be integrated into existing clinical workflows and decision-making processes.By leveraging the model's predictions to personalize care planning, allocate resources effectively, stratify patient risks, enhance discharge planning, and support clinical decision-making, healthcare providers can optimize patient outcomes, improve resource utilization, and enhance the overall quality of care delivery for tuberculosis spondylitis patients undergoing surgery.

Limitations
It is important to note the limitations of this study.First off, the study's sample size was quite small, which would restrict how broadly the results can be applied.A larger and more diverse sample would enhance the reliability and external validity of the results.Second, the quality and availability of the data collected could have influenced the accuracy and robustness of the predictive model.Retrospective or incomplete data may introduce biases or affect the model's performance.Third, external validation of the model was not conducted using an independent dataset.Without external validation, it is challenging to ascertain the model's applicability to different populations or healthcare settings.Fourth, the selection of variables in the model may be incomplete.It is possible that other relevant variables were overlooked, and including them could improve the predictive power of the model.Fifth, due to retrospective study from one -single institution, selection bias, institutional bias, and patient diversity exist in our study, further large sample, multicenter studies need to tackle these problems.Sixth, missing data could lead to bias, reduced sample size and imputation uncertainty, in future research, we aim to further explore the impact of missing data.Lastly, while the study employed SHAP and LIME techniques to interpret the variables' contributions to the predicted outcomes, machine learning models are inherently black-box, making it challenging to fully understand the underlying mechanisms and causal relationships.To fully comprehend the causes causing prolonged hospital admissions in patients with TB spondylitis, more studies may be required.It is critical to recognize these limitations, because they open up possibilities for future study and might lead to advancements in the creation of prediction models for prolonged hospital stays in TB spondylitis patients.These limitations highlight opportunities for future research to address these issues and enhance the development of predictive models in this context.

Conclusion
In this study, we utilized explainable artificial intelligence (XAI) techniques, such as SHAP and LIME, to predict prolonged hospital stays after surgery for tuberculosis spondylitis patients.Furthermore, the unbalanced data was the deal was manipulated via SMOTE technique.The model incorporated preoperative clinical features and examination results to forecast the duration of hospitalization, resulting in high accuracy and predictive capability.This model offers crucial decision support for healthcare professionals and insights to improve surgical outcomes and reduce hospital stay duration for these patients with the online web application.

Fig. 4 Fig. 5
Fig. 4 Bar plot of comparison of nested and non-nested cross-validation approach

Fig. 6
Fig. 6 ROC with tenfold cross-validation.Receiver operating characteristic curve, illustrating the model's trade-off between sensitivity and the rate of false positives in light of the range of expected probabilities of future prolonged PLOS.The area under the curve shows how well the model can predict whether future prolonged PLOS will develop or not.A random estimator's reference value is 0.5, while the value of a perfect model would be 1.0 (shown by the dotted line)

Fig. 7
Fig. 7 Calibration plots were generated for both the training set (A) and test set (B).These plots illustrate the disparity between the model's predicted probability of prolonged length of stay after surgery and the actual frequency observed in the study sample.The dashed line represents the ideal calibration of the model.The results indicate a close resemblance between the predicted and observed probabilities.Decision curve analysis curves were plotted for both the training set (C) and the testing set (B). Positioned in the upper right corner, the decision curve analysis curve clearly demonstrates the considerable net benefits offered by the predictive model

Fig. 9
Fig. 9 List of variables ranked by their importance to prolonged LOS with different combinations of model and importance calculation algorithm

Fig. 11
Fig. 11 Screenshots of online free web application user interface

Table 1
Baseline characteristics of patients